An improved Differential evolution with Sailfish optimizer (DESFO) for handling feature selection problem

As a preprocessing for machine learning and data mining, Feature Selection plays an important role. Feature selection aims to streamline high-dimensional data by eliminating irrelevant and redundant features, which reduces the potential curse of dimensionality of a given large dataset. When working with datasets containing many features, algorithms that aim to identify the most valuable features to improve dataset accuracy may encounter difficulties because of local optima. Many studies have been conducted to solve this problem. One of the solutions is to use meta-heuristic techniques. This paper presents a combination of the Differential evolution and the sailfish optimizer algorithms (DESFO) to tackle the feature selection problem. To assess the effectiveness of the proposed algorithm, a comparison between Differential Evolution, sailfish optimizer, and nine other modern algorithms, including different optimization algorithms, is presented. The evaluation used Random forest and key nearest neighbors as quality measures. The experimental results show that the proposed algorithm is a superior algorithm compared to others. It significantly impacts high classification accuracy, achieving 85.7% with the Random Forest classifier and 100% with the Key Nearest Neighbors classifier across 14 multi-scale benchmarks. According to fitness values, it gained 71% with the Random forest and 85.7% with the Key Nearest Neighbors classifiers.

However, some of these features may need to be more relevant, redundant, or contain noise.Such characteristics in the dataset could result in over-fitting data or create ambiguity in the learning mechanism 1,2 .
Feature Selection (FS) is commonly employed as a prepossessing step to improve the accuracy of a classification model.The core objective of FS is to identify the most relevant features that positively impact model performance while discarding irrelevant or harmful features at a minimal cost 3 .Various algorithms have been created to identify the most effective set of features that can improve the accuracy of a classification model for a given dataset.When dealing with datasets containing many features, traditional algorithms encounter challenges in identifying the significant features.
There are three FS (Feature Selection) algorithm types: filter, wrapper, and embedding.Regarding filtering algorithms, the FS process and classifier model are treated as distinct phases.During the initial phase, specific metrics extract features from the dataset that significantly impact the classification process while ignoring the others.In the feature selection process, only the chosen attributes are used in the classification model for its phase.However, wrapper algorithms modify the selected feature subsets dynamically, depending on the accuracy of the classifier.In Feature Selection (FS), the wrapper approach is commonly used.This approach involves generating subsets of features using specific search methods and determining their relevance by running a classification algorithm.Embedded algorithms are then combined with a classifier to decide which features should be kept or removed from the dataset [4][5][6] .
As per reference 7 , FS is widely believed to present a combinatorial optimization problem that is most likely NP-complete.Each feature in a dataset has twice as many potential solutions, making it challenging and timeconsuming to determine the most efficient subset of features.Additionally, in references 8,9 , the feature selection (FS) problem is a problem in the field of optimization that is considered to be NP-hard.This means that the more complex the problem, the longer it takes to compute the solution, with computational time increasing exponentially.Hence, researchers have shown a keen interest in meta-heuristic (MH) algorithms 10 ; four main categories of algorithms excel in solving various optimization problems.These categories include Human-based algorithms, Swarm intelligence algorithms (SI), Physics-based algorithms (PA), and Evolutionary Algorithms (EA).
Swarms and animal behavioral patterns are the basis for SI algorithms 11 .A commonly employed algorithm in optimization problems is Particle Swarm Optimization (PSO).The algorithm is designed based on the collective behaviors of swarm objects.In this approach, every individual object represents a potential solution 12 .The concept behind Artificial Fish Swarm (AFS) involves replicating the actions of fish, such as hunting, gathering in groups, and tracking, to perform a localized search of individuals to attain a global optimal solution.This technique is discussed in reference 13 .Bacterial Foraging Optimization (BFO) is a recently developed algorithm that draws inspiration from the foraging behavior of Escherichia coli in humans.It involves competition and cooperation among bacterial populations and is employed as a global random search algorithm 14 .Ant Colony Optimization (ACO) is a well-known swarm intelligence algorithm that imitates the foraging behavior of different ant species.In natural settings, ants use chemical pheromones to identify the most optimal path for the colony members to follow 15 .A swarm intelligence optimizer known as pigeon-inspired optimization solves air-robot path planning problems.The technique involves using a map and compass operator model based on a magnetic field and the sun and a landmark operator model that utilizes landmarks 16 .The bat algorithm is a metaheuristic algorithm based on the behavior of animal groups or herds.It uses the echolocation behavior of bats to generate solutions for domains with single-or multi-objectives that exist within a continuous solution space.This information is based on reference 17 .The grey wolf optimizer is an algorithm that imitates the leadership hierarchy and hunting mechanisms of grey wolves in nature and is categorized as a swarm intelligence algorithm 18 .
To effectively search a given space, any search algorithm must balance exploring new areas within that space with exploiting already known areas.This means it must balance venturing into uncharted territory and focusing on areas near previously explored locations.By achieving an optimal balance between exploration and exploitation, a search algorithm is more likely to succeed in its search efforts 19 .
There have been multiple attempts to understand the mechanism that regulates the equilibrium between exploration and exploitation in search algorithms.However, due to the need for more consistent knowledge, several interesting metrics have been proposed to quantify the level of exploration and exploitation in metaheuristic schemes.These metrics monitor the current diversity of the population and have been suggested in various indexes.Despite several indexes and ongoing proposals, there is yet to be a definitive or objective way to measure metaheuristic algorithms' exploration/exploitation rate 20 .Achieving success with metaheuristic algorithms requires a careful balance between exploration and exploitation throughout the evolutionary process.To achieve this balance more effectively, it is important to optimize the level of exploration and exploitation 21 .
Many SI algorithms that show high performance in various optimization problems have been developed in the literature.Some of these algorithms include the sailfish optimizer (SFO) 22 , Chaotic Coyote Algorithm 23 , Modified Social-Spider Optimization Algorithm 24 , Cheetah Optimization Algorithm 25 , Migrating Birds Optimization 26 , Owl Optimization Algorithm 27 , Bacterial Foraging Optimization Algorithm 28 , Salp Swarm Algorithm (SSA) 29 .
Many metaheuristic algorithms are based on evolutionary behaviors that emulate biological processes such as mutation, crossover, and selection, and they are named EA algorithms.Some of these algorithms include Differential Evolution (DE) 30 , Genetic Algorithm (GA) 31 , Invasive Tumor Growth Optimizer (ITGO) 32 and Biogeography-Based Optimizer (BBO) 33 .These algorithms have shown great efficiency in various optimization applications.
Optimization algorithms that are based on physical laws are called PhA algorithms and include Big Bang-Big Crunch BBBC 34 , Multi-verse Optimizer (MVO) 35 , and Gravitational Search Algorithm (GSA) 36 .
1.A new algorithm called the DESFO algorithm has been created by integrating and reproducing DE and SFO. 2. The transfer function (TF) is the V-shaped function to convert position values into binary format.3. The periodic mode boundary handling (PMBH) approach and a novel local search (LS) strategy are used to improve the exploration and exploitation process.4. In supervised classification, the DESFO algorithm is used for wrapper feature selection.5.The DESFO's performance is evaluated through metrics such as average fitness rate, average accuracy rate, and average number of selected features.6.To assess the effectiveness of the suggested DESFO algorithm with the RF and K-NN machine classification algorithms, a Wilcoxon's non-parametric rank-sum test (with a significance level of 5%) is used to compare it with similar algorithms.

Structure
The paper follows the structure outlined below: 1. Section "Related works" provides the recent stats of art and related works.2. Section "Preliminary work" provides Preliminary works and explanations about the original DE and SFO algorithms.3. Section "Methodology of the proposed DESFO" introduces the methodology of the proposed algorithm DESFO, along with the related steps.4. Section "Experimental results and analysis" presents the experimental results of the DESFO algorithm and compares it with other MH algorithms. 5. Section "Conclusion and future works" concludes the paper.

Related works
Numerous research studies have been conducted in feature selection utilizing metaheuristic algorithms.Some of these efforts are outlined below.
Rodrigues et al. 37 introduced a binary cuckoo search algorithm called BCS, which uses a function to convert continuous variables to their binary form to obtain the optimal feature subset.The Optimum Path Forest classifier was used to apply BCS on two datasets related to theft detection in a power system.The results indicated that BCS was the most efficient and appropriate method for solving feature selection issues in industrial datasets while also being the fastest.
In their study, Emary et al. 38 introduced the initial binary edition of the firefly algorithm (FFA) for addressing feature selection issues by utilizing a threshold value.The algorithm exhibited a high level of exploration quality, enabling it to swiftly identify a solution to the problem.
To tackle feature selection problems, Nakamura et al. 39 developed a binary version of BA called BBA.They used a sigmoid function to confine the position of bats to binary variables.They employed the optimum path forest classifier and applied BBA to five datasets to evaluate the accuracy.
Zawbaa et al. 40 proposed a binary version of the ALO algorithm to address the feature selection problem by applying a threshold value to continuous variables.In their study, Emary et al. 41 employed the sigmoidal transfer function to obtain binary vectors, also known as bGWO.They evaluated the classification accuracy of these vectors using a K-NN classifier across eighteen distinct UCI datasets.The researchers also utilized small, random, and large initialization methods during the initialization phase to facilitate thorough exploration.
Hussien et al. 42,43 utilized S and V-shaped transfer functions in conventional WOA to solve binary optimization problems.They also applied this method to solve feature selection problems with eleven UCI datasets.To ensure the relevance of the selected features for classification, they used the K-NN classifier.
In their study, Gad et al. 44 introduced a new version of the sparrow search algorithm, which has been developed.This version uses a combination of random agent repositioning and the LS method to handle feature selection effectively in supervised classification tasks.This approach is particularly useful for choosing the best or nearly optimal subset of attributes from a given dataset while maintaining maximum accuracy rates.
Ghosh et al. 45 have presented a new variant of the latest and most powerful optimizer, the Sailfish Optimizer (SFO), called the Binary Sailfish (BSF) optimizer for solving FS problems.They utilized the sigmoid transfer function to convert the continuous search space of SFO into a binary one.They also incorporated adaptive β-hill climbing (AβHC), a recently proposed meta-heuristic algorithm, with the BSF optimizer to enhance its exploitation ability.
Emrah et al. 46 have proposed a new filter criterion that mutual information, ReliefF, and Fisher Score inspire.Rather than relying on mutual redundancy, this criterion aims to select the most highly ranked features determined by Relief and Fisher Score while ensuring mutual relevance between the features and class labels.Based on this new criterion, the team has developed two novel differential evolution (DE) based filter approaches.
Bacanin et al. 47 , presented a diversity-oriented social network search to tackle the feature selection problem in detecting phishing websites.The authors aimed to enhance the detection of phishing websites by refining an extreme learning model that leverages the most pertinent subset of features from the phishing websites dataset.A new algorithm was developed and integrated into a two-level cooperative framework to accomplish this.The efficacy of the proposed algorithm was then evaluated and compared against six other state-of-the-art metaheuristics algorithms.
Alrefai et al. 48Proposed an effective method for cancer classification using ensemble learning.The study employed particle swarm optimization and an ensemble learning method for feature selection and cancer classification.The study's findings indicate that the proposed method is effective for cancer classification based on microarray datasets.Furthermore, the accuracy of the proposed method proves its superiority over other methods.
Gomez et al 49 proposed a new technique called Two-Step Swarm Intelligence.The method involves breaking down the heuristic search carried out by agents into two stages.In the first phase, agents generate partial solutions, used as starting states in the second phase.Our study aimed to assess the effectiveness of this approach in resolving the Feature Selection Problem using Ant Colony Optimization and Particle Swarm Optimization.The feature selection is based on the reduction concept in the Rough Set Theory.The results demonstrate that the Two-Step Swarm Intelligence method improves the performance of ACO and PSO metaheuristics regarding computation time and the quality of reduction produced.
Bezdan et al. 50proposed an algorithm based on a binary hybrid metaheuristic approach to select the optimal feature subset.Specifically, they combined the brainstorm optimization algorithm with the firefly algorithm to create a wrapper method for feature selection problems on classification data sets.The performance of the proposed algorithm was evaluated on 21 data sets and compared against 11 other metaheuristic algorithms.Additionally, the algorithm was applied to the coronavirus data set.
Gao et al. 51 Introduced a Clustering Probabilistic Particle Swarm Optimization (CPPSO) to improve the traditional particle swarm optimization approach.CPPSO incorporates probabilities to represent velocity and an elitism mechanism.Additionally, CPPSO uses the K-means algorithm to cluster the population based on the Hamming distance into two sub-populations, which enhances its performance.The effectiveness of CPPSO is evaluated by comparing it against seven existing algorithms using twenty diverse datasets.
Latha et al. 52 Addressed the feature selection problem by implementing grey wolf optimization (GWO) with decomposed random differential grouping (DrnDG-GWO) as a supervised learning technique.The study found that combining supervised machine learning with swarm intelligence techniques yielded the best feature optimization results.

Motivations
Storn et al. 30 proposed the differential evolution (DE) algorithm in 1997, a powerful and straightforward stochastic search method operating on populations.DE is an effective global optimizer for continuous search problems and has been successfully applied in various domains, such as pattern recognition 53 , communication 54 , and mechanical engineering 55,56 .
The Sailfish Optimizer (SFO) is a highly effective optimization algorithm developed and presented in 2019 by a team of researchers known as Shadravan et al. 22 .This algorithm is based on the concept of population, and it mimics the hunting behavior of a group of sailfish as they hunt for a school of sardines.The strategy employed by the sailfish group involves alternating between attacking a group of sardines and retreating to capture their prey.The SFO algorithm has become popular in the optimization community due to its robustness and effectiveness.In this paper, an algorithm called DESFO that integrates both DE and SFO has been proposed.Due to their power and superiority, the proposed algorithm can attain satisfactory search accuracy, swift convergence speed, and improved stability.
Moreover, it can prevent getting stuck in local optima, which is an issue that still needs to be systematically addressed for the FS problem.On the other hand, compared to the state-of-the-art meta-heuristic techniques, including the original DE and SFO, the DESFO approach yields superior results by producing optimal or near-optimal outcomes for numerous problems.The proposed feature selection algorithm method was tested on 14 benchmarks using multi-scale attributes and records from the UCI machine learning repository.This implementation was carried out 30 times to validate its efficacy 57 .The average classification accuracy is calculated using two standard machine learning classification algorithms: Random Forest (RF) and k-nearest Neighbor (k-NN).

Preliminary work
As mentioned in the previous section, meta-heuristics have several benefits, but can existing methods adequately solve the FS problem?The No Free Lunch theorem (NFL) 58 answers this question.This theorem suggests that no single algorithm can perfectly solve all optimization problems.In the case of FS on a dataset, an algorithm may perform exceptionally well for one dataset but inadequately for another.Therefore, there is still a need for an advanced metaheuristic approach that can efficiently solve almost all possible FS dataset types, which is currently an open research question.From this point in this section of the paper, the basic DE algorithm and SFO algorithm will be explained.The two algorithms will be integrated under the DESFO algorithm to optimize the feature selection problem and enhance classification accuracy.

Differential evolution algorithm (DE)
In 1997, Storn et al. 30 introduced a Differential Evolution (DE) algorithm, considered one of the most reliable versions of Evolutionary Algorithms.It is known for its fast convergence, user-friendly nature, and ease of implementation.Additionally, the same set of parameters, such as Population size (NP), Crossover rate (Cr), and Scaling Factor (F), can be applied to address various optimization problems.The process begins with a given set of solutions.Then, a modified or mutant solution is produced for each solution vector in the current set by adding the weighted difference between two candidate solutions to other candidate solutions.This method, known as Differential Evolution (DE), has proven effective and widely applied in various optimization problems in different scientific and engineering domains 59  The structure and primary search operators utilized by the DE algorithm are explained as the following:

Mutation
In every epoch (t), a mutation operator is applied by DE to generate a new donor vector, also known as a mutant vector, for each target solution.The mutation operator randomly selects three candidate solutions according to Eq. ( 1); it demonstrates that the donor vector is created by scaling the difference vector between two vectors and then adding the result to the third solution 30 .
In this process, three distinct integers r1, r2andr3 are randomly selected, and ∈ [1, NP] where NP is a positive integer greater than or equal four.Additionally, these integers are different from the running index i.The differential amplification x r2,G − x r3,G is then amplified by a constant factor F, which ranges from 0 to 2.

Crossover
After mutation, a crossover search operator produces an offspring (trial) vector from the target solution.The exponential and binomial crossover search operators are the most frequently used and uncomplicated ones.Please keep in mind that for each decision variable (DV) j in the scenario where ( rand ≤ C r ), do the following: where a random value j rand is selected from the range of, where N x is a specified value, a value chosen at random and referred to as "jth evaluation, " denoted by rand(j) is selected from a uniform random number range of [0, 1].This ensures that at least one DV (design variable) is obtained from the trial vector.The crossover rate C r , which is used to control the number of variables, is obtained from the donor vector, and it is guaranteed that V i,G+1 provides at least one parameter to u i,j,G Selection A selection operator is utilized to determine the optimal solution by comparing the objective function values of both the parent and offspring.If the offspring has a lower objective function value, it is preserved for the subsequent iterations.If not, the parent vector is mathematically represented within that particular generation, and it is obtained using: To determine if it should join generation G + 1, the trial vector x i,G+1 is evaluated against the target vector x i,G using the greedy criterion.If the trial vector x i,G+1 results in a lower cost function value compared to the target vector x i,G , then the trial vector x i,G+1 replaces the target vector u i,G ; if not, the original target vector x i,G value is kept.

The sailfish optimizer (SFO)
Shadravan et al. 22 developed a unique algorithm called sailfish optimizer (SFO) in 2019, which is based on swarm intelligence and is a population-based algorithm.To devise this technique, the scientists took cues from a pack of predatory sailfish.The approach involves the use of two distinct populations.The sailfish population is responsible for intensifying the search around the current best solution, while the sardine population diversifies the search space.The sailfishes are considered potential solutions, and their positions in the search space represent the problem's variables.The algorithm aims to randomize all search agents' movement (sailfish and sardine) to the greatest extent possible.Sailfishes are dispersed throughout the search space, while the positions of sardines aid in discovering the optimal solution in the search space.
The algorithm identifies the sardine with the best fitness value as the 'injured' fish, with its position denoted as ( P i srdinj ) at the i th iteration.During each iteration, the positions of both sardines and sailfishes are updated.For the i th iteration, the position of a sailfish is updated using the 'elite' sailfish P i Slfbest and the 'injured' sardine based on a specific criterion.
The position of sailfishes and sardines is modified at each iteration denoted by i+ , and the (elite) and (injured) alter or update the position of a sailfish to a new one denoted by.The updating is done according to Eq. ( 4) 37 : where the value of rnd ∈ (0,1) is a random value, and the coefficient µ i is generated by Eq. ( 5): where In each iteration, the prey density ( PrD ), which represents the number of prey available, is determined using Eq. ( 3).As the number of prey decreases during group hunting, the value of PrD decreases accordingly. (1) Sailfish's and sardine numbers are represented by N Slf andN srd, respectively.The Num Slf can be calculated according to Eq. ( 7): Please keep in mind that ( Prcent ) refers to the percentage of the sardine population that constitutes the initial sailfish population.It is also assumed that the initial number of sardines exceeds the number of sailfish.
The positions of the sardines are updated in each iteration according to Eq. ( 8): The old position and the updated position of the sardine are represented by P i Srd and P i+1, Srd respectively.While the ATK represents the power of the sailfish attack at each iteration i th and can be calculated by Eq. ( 9): ATK is crucial in determining the number of sardines that update their positions and the extent of their displacement.Decreasing ATK can facilitate the convergence of search agents.Based on the ATK parameter, the values of γ (number of sardines that update their position) and δ(number of variables) of the sardines are computed using Eqs.( 10) and ( 11): where N Srd and v denote the sardine number and the number of variables, respectively, if a sardine surpasses the fitness level of any sailfish, the sailfish will adjust its position to follow that sardine.In contrast, the sardine is removed from its population.
To explore the search space effectively, it's important to select both sailfishes and sardines randomly.Sailfishes have a decreasing attack power after each iteration, allowing sardines to escape from the most aggressive sailfish.This helps to balance the exploration and exploitation of the search space.The ATK parameter is used to find the optimal balance between both of them.

Methodology of the proposed DESFO
Improving the accuracy of classifiers involves focusing on pertinent features.Some Recent research studies 1,60 suggest utilizing the methodology of feature selection (FS) to substitute a sizable quantity of insignificant features with a more concise and applicable subset of features.FS categorizes features as essential or non-essential, marking them as 1 or 0. This paper presents a hyped algorithm named (DESFO) which consists of two algorithms, (DE) differential evolution and (SFO) sailfish optimizer, for implementing FS.The algorithm comprises several stages: initialization, position updating, binary conversion, exploration optimization via a new strategy, and exploitation optimization.
Table 2 displays the number of iterations allocated for each algorithm, which is 100.For the proposed algorithm, DESFO, this number was distributed equally between DE and SFO, with 50 iterations each.DE optimized the first 50 iterations to obtain the optimal solution, which was then passed on to SFO to enhance selected relevant features and achieve the best classification accuracy.The following sections provide detailed explanations of each of these stages.

Initial population generation
The first step in using the DESFO algorithm is generating an initial population of X positions representing potential solutions in a D-dimensional space.The population size is determined using a specific formula.
X signifies the overall number of positions, while D represents the problem's dimensionality.The position matrix is defined as: The j th solution is represented by M i,j , where j is the j th component.M , the initial population, is generated within predefined bounders as:

Position update in DESFO
Updating the position involves using the equations of DE and SFO as described in subsections 3.1 and 3.2.After updating the position, it goes through binary conversion, as explained in Subsection 4.3.The fitness function then assesses the binary-transformed vector to calculate the classification error while keeping the original format of the vector for future updates.

Position binary conversions
Converting the values of meerkat positions from continuous to binary is necessary before assessing their fitness using the FS method.This is because the DESFO method, which is used to derive the position values, differs from the binary framework of FS, making it challenging to apply the latter directly to binary/discrete problems.The feature selection (FS) method uses a vector of binary values, where the selected features are represented by 1s, indicating 0s represent their continuous values and the non-selected features.The length of the solution vector is equivalent to the count of features in the original dataset.
A transfer function (TF) has been utilized in the proposed algorithm, which Fang et al. suggested 61 , which has a V-shaped curve and is known for its exceptional global search capability.The function is expressed as follows: The position value obtained is represented by y , and a DESFO position is considered to have a valid TF output where α is less than 0.64 and falls within the range of [0, 1].The defined update rule for DESFO's position is based on the following equation:

Fitness evaluation
The DESFO framework and a new FS-based technique incorporate k-NN and RF as evaluative mechanisms.The k-NN method 62 selects the most common class among the closest neighbors to predict the classification of new instances.On the other hand, the RF, explained in 44 , uses decision trees to recursively divide the training data into small sets, which helps optimize the classification task by using an impurity criterion such as information gain or "gini" index 63 .These classifiers are particularly efficient in handling high-dimensional data and require minimal computational effort, as stated in 62 .
Achieving the right balance between accuracy and feature set size is crucial in DESFO.While opting for a smaller feature set can improve the precision of classifiers such as k-NN and RF, it may also compromise accuracy due to the reduced feature set 64 .The relationship between the size of the feature set and the preferred features is inversely proportional, which means there is a potential trade-off between accuracy and feature set size.Therefore, the PMBH method is vital in balancing feature selection and classification accuracy 65 .
When assessing the effectiveness of an algorithm, it is essential to consider the trade-off between precision and feature size.This trade-off can be mathematically represented as: In the given equation, there are two weight coefficients, α1 and α2, where α1 is a value between 0 and 1, and α2 is determined by subtracting α1 from 1.These values have been determined through extensive testing, as mentioned in reference, and the expression represents the ratio of the selected features to the total number of features in the original dataset.The main objective of this design is to increase precision while reducing the length of the feature set, as suggested in reference 38 .The value |D *| represents the size of the selected feature set, while |D| represents the total number of features in the original dataset.

Improving exploration
Search agents like meerkats tend to explore outside their assigned search areas to find optimal solutions.However, issues may arise when using boundary-handling techniques to keep an agent within the initial search territory, as discussed in 61 .The two primary traditional methods for boundary handling are Boundary and Random modes.In Boundary mode, if a solution's dimension d goes beyond the search space S, it gets repositioned to the nearest boundary, either lower bound L or upper bound U. Conversely, dimension d of S receives random value mutations in Random mode.These traditional methods, however, have limitations in fully exploring the search space.Therefore, Periodic Mode Boundary Handling (PMBH) was developed as per 61 , aiming to improve the exploration phase.PMBH allows for infinite search space for agent movement, consisting of periodic replicas of the original space S, maintaining the same fitness landscape, as shown in Fig. 1.

Exploitation optimization
This particular segment notices the updated LS principles of the enhanced DESFO.These principles aim to improve the efficiency of algorithms and ensure better utilization by generating a fresh population with optimal positions while maintaining the essential structure.( 14) Three main principles guide the proposed approach.Firstly, to address the limitation of the original algorithm that lacks a mechanism to recall and preserve the best solutions over iterations, a binary matrix has been introduced to store the top solutions obtained previously.Secondly, repetitive best solution patterns resulting from binary conversion can reduce exploitation effectiveness, which can be improved by incorporating distinct solutions in the binary matrix.Lastly, the LS strategy relies on identifying solutions close to the best discovered by converting continuous positions into binary format and following a constrained normal distribution, as shown in Eq. (17).
The solution obtained through minor mutation slightly deviates from the current best, due to a random factor represented by β which is normally distributed N(0.0, 0.4) .The optimal solution is initially added to an empty set to find local search solutions.The set has a fixed maximum size, LS max .Then, a new solution is generated by applying Eq. ( 17) on the current g best , which is then converted to binary and assessed for fitness.If this new solution outperforms the current best, it is considered the best solution.

Complexity analysis
In analyzing the complexity of the DESFO, we can delve deeper into the computational processes involved.This includes looking at the computational demands of evaluating classifiers and the benefits of using combined methods in terms of efficiency.

2-Differential Evolution Operations:
• Mutation: For the mutation step to be executed across all individuals, it involves choosing three different individuals and then computing the vector differences for each, which amounts to a complexity of (D) for each individual.Consequently, the total complexity for the mutation step applied to all individuals is O (NP × D). • Crossover: for each person, determined by the probability CR, this leads to O (NP × D).
• Selection: Evaluating and selecting the better individual between the target and trial vector typically involves fitness computation, which can be a significant factor depending on the complexity of the fitness function.the complexity is O(NP)

Experimental results and analysis
The following part of the paper presents the results from the proposed DESFO algorithm and compares them with those reported in prior studies.To verify the proposed algorithm, 14 multi-scale benchmarks were utilizedthe mean values in the results are represented as evaluation metrics.To showcase the efficacy of the suggested algorithm, in all experiments, we employed the datasets that are elaborated in subsection 5.1,Moreover, the metaheuristic techniques' main parameters utilized in this paper are outlined in subsection 5.2, in subsection 5.3, evaluation measures are explained, then, in subsection 5.4, the proposed DESFO algorithm is evaluated and compared with the k-NN and RF algorithms to investigate their respective results, in subsection 5.5, An investigation was conducted to compare the outcomes of the suggested DESFO algorithm with those of other methods, Convergence graphs are depicted in Sect.5.6, in subsection 5.7, the Wilcoxon's test is conducted to assess the credibility of differences in fitness rates between the proposed DESFO algorithm and its counterparts and the final Sect.5.8 is for discussion of the results.

Benchmarks description
The proposed algorithm's performance is demonstrated using 14 multi-domain features and instance benchmarks.These benchmarks are obtained from the UCI machine learning repository 57 .A variety of attributes and instances in each benchmark is beneficial in validating the proposed algorithm.Table 1 provides an overview of the benchmarks used in this paper, along with their respective properties and descriptions.The datasets shown in Table 1 are sorted in descending according to the number of features.

Parameters configuration
The DESFO algorithm proposed in this study was evaluated against several meta-heuristic algorithms, including the two original algorithms that were combined, the Differential Evolution (DE) algorithm 30 and the sailfish optimization (SFO) algorithm 22 , as well as nine of the other algorithms, including Harris Hawks Optimization (HHO) 66 , Particle Swarm Optimization (PSO) 67 , Bat Algorithm (BA) 17 , Whale Optimization Algorithm (WOA) 68 , Grasshopper Optimization Algorithm(GOA) 69 , Grey Wolf Optimization (GWO) 18 , Bird Swarm Algorithm (BSA) 70 , Henry gas solubility optimization (HGSO) 71 , and Artificial Bee Colony (ABC) 11 .In this work, the ML classifiers' primary parameters have been established as follows: the k-NN classifier's Euclidean distance metric has been approximated to be 5.The estimation was based on the outcomes obtained from previous papers, such as 72 .On the other hand, the Random forest (RF) classifier 73 is a popular machine-learning algorithm often used for complex tasks such as time-series forecasting, image classification, facial expression recognition, action recognition and detection, visual tracking, label distribution learning, and more.Every method is evaluated on each dataset by conducting 30 distinct experiments.The results are reported according to the mean performance measures.To maintain equality in the evaluation process, each method had a population size of 10 and a maximum of 100 iterations.The size of the datasets used was proportional to the complexity of the problem.The exploration of the continuous search space was confined yet extensive by establishing the search domain as A validation process is necessary to assess the optimality achieved by the outcomes in the framework, so a tenfold cross-validation method is employed.This ensures that the values obtained are reliable.The benchmark is randomly split into two subsets, with 80% of the benchmark used for training and the remaining for testing purposes 3 .During the learning process of the machine learning classifier, sunset for training is used and optimized, while the test subset is used to evaluate the selected features.Table 2 displays the standard configurations for all techniques and the parameter settings for each method, which were determined based on the original variants and the data included in their initial publications.Python is used to run the processes on a computer system environment equipped with a CPU, an Intel i7 processor, RAM, which is 16 GB, and a GPU, which is NVIDIA GTX 1050i.

Metrics of performance
The DESFO algorithm performance is compared to other methods, and each approach is assessed independently in 30 runs per benchmark.The evaluation of the FS strategy employs certain measures to conduct this assessment.
Mean accuracy: The accurate data classification rate ( Mean acc ) can be determined by executing the method independently for 30 runs:  where mean accuracy is represented by Mean acc , while the number of samples in the subset of testing is denoted by m, the predicted class label for a sample is denoted by PLr.In contrast, the reference class label is denoted by ALr.A function called match (PL r , AL r ) compares these labels.When PLr is equal to ALr, the value of match (PLr, ALr) is 1; otherwise, it is 0.

Mean fitness value:
The metric (Mean Fit ) measures the average fitness results achieved through the recommended approach by running it individually for 30 runs.This highlights how decreasing the number of chosen features can lead to a lower error classification rate, as per Eq. ( 16).The best result is indicated by the minimum value, which is evaluated based on fitness as: The Mean Fit denotes the mean or average fitness value, while f k * indicates the best possible fitness outcome attained during each run of the 30 k-th runs.
The mean number of features selected: This metric, which MeanFeat denotes, represents the mean or average count of chosen features obtained by performing the technique independently for 30 runs and is defined as: where d k * denotes the selected features, the number of features for the optimal solution for each run of the thirty k-th runs, while |D| denotes the number of the complete features used from the benchmarks.

• Wilcoxon's rank-sum test:
To gain a deeper insight into the importance of the method discussed statistical evidence must demonstrate its effectiveness.Therefore, the efficacy of the results derived from the methods used is often validated by employing the Wilcoxon rank-sum non-parametric test.This is favored for its ability to statistically distinguish the significance and dependability of various competing methods 74 .In this study, the focus is on evaluating the proposed DESFO method in comparison with other algorithms.A null hypothesis is put forward, suggesting no difference in performance between the DESFO algorithm and the others when compared pairwise.Conversely, if proven otherwise, the DESFO algorithm outperforms the rest.The assessment hinges on the calculation of a p-value through the Wilcoxon rank-sum test, which helps analyze the differences in outcomes from 30 separate executions of both the DESFO and competing algorithms.

The results of ML classifiers (k-NN and RF) and DESFO
The mean accuracy ( Mean acc ) was used to compare the performance of the presented ML classifiers (RF and k-NN) with the proposed methods (DESFO-RF and DESFO-K-NN) and the mean number of selected features ( Mean Feat ) in this subsection are also given.This was done to evaluate the effectiveness and scope of the DESFO approach.

Comparisons of DESFO-K-NN and K-NN
In Table 3, a comparison between the DESFO-K-NN technique and the basic K-NN algorithm is demonstrated.The evaluation is centered on two metrics to measure performance: the average accuracy of classification (Mean Acc ) and the average count of selected features (Mean Feat ).After analyzing Table 3, it is worth mentioning that the DESFO-K-NN technique led to an increase in Mean Acc on all benchmarks.The increase was more than 15% on four of them.Moreover, Mean Acc had a score of over 93% on nine out of the total fourteen benchmarks.It even achieved 100% Mean Acc on four of them.It is worth mentioning that the Mean Feat has decreased in 93% of the benchmarks due to implementing the DESFO-K-NN method as suggested.However, the DESFO-K-NN method could not improve the Mean Feat on the Tic-tac-toe benchmark.Finally, it was found that the DESFO-K-NN technique outperformed the basic K-NN in terms of Mean Acc and most of the benchmarks.On the other hand, the suggested MeanFeat of the DESFO-k-NN approach has shown promising results in feature selection compared to the basic k-NN tested with the chosen datasets.

Comparisons of DESFO-RF and RF
In Table 4, a comparison between the DESFO-RF algorithm and the basic RF algorithm is demonstrated.The comparison is based on two performance metrics: the mean accuracy of classification (Mean Acc ) and the mean number of chosen features (Mean Feat ).
After analyzing Table 4, it is worth mentioning that the DESFO-RF technique led to an increase in Mean Acc on 93% of all benchmarks.The increase was more than 15% on four of them.Moreover, Mean Acc had a score of over 92% on nine out of the total fourteen benchmarks.It even achieved 100% Mean Acc on three of them.It is monitored that DESFO-RF and basic RF are equal in accuracy in one of the WineEW benchmarks.It is worth mentioning that the Mean Feat has decreased in 100% of the benchmarks due to implementing the DESFO-RF method as suggested.However, finally, it was found that the DESFO-RF method outperformed the original RF ( 18)

DESFO results versus other MH algorithms
To prove the effectiveness of DESFO in comparison with DESFO-RF and DESFO-K-NN, which rely on RF and k-NN classifiers, respectively, a comparison was made between DESFO and other meta-heuristic methods such as DE, SFO, ABC, PSO, BA, GWO, WOA, GOA, HHO, BSA, and HGSO, all of which were conducted under identical conditions.The comparison results were measured in terms of mean fitness value (Maean Fit ), mean accuracy (Maean Acc ), and mean number of features selected (Mean Feat ).

Comparisons based on the RF classifier
Table 5 presents the fitness values obtained from the proposed DESFO-RF meta-heuristic optimization algorithm, compared with those of other advanced optimization techniques in addressing the FS issue.Table 5 shows that DESFO-RF showed superior performance compared to other methods.In the FS problem, it scored the highest in 8 benchmarks and achieved the same score as the others in 2 benchmarks.This led to a more significant impact in 10 out of the 14 benchmarks, equivalent to 71% of all the benchmarks.Furthermore, the benchmark employed in this research comprises benchmarks of varying sizes, demonstrating the ability of DESFO-RF to deliver consistent performance across the entire range of benchmarks, regardless of their size.It was observed that DESFO-RF missed out on 4 benchmarks, but the results obtained were much closer to the methods used by SFO and ABC when the mean fitness values were compared.This indicates that the DESFO-RF has better outcomes than its competitors.It has been discovered that the DESFO-RF method suggested by the team ranked first in all benchmarks except for SFO.This provides further evidence of the effectiveness of the proposed method over other techniques used by competitors.Table 6 compares the classification accuracy means of the presented DESFO-RF with other advanced metaheuristic optimization algorithms in tackling the FS issue, as per the empirical findings.It's worth mentioning that, according to Table 6, the DESFO-RF approach showed better performance than all other methods in terms of accuracy mean across seven benchmarks.Moreover, it delivered equivalent results to other methods across five benchmarks but needed to be more fortunate to outperform them in two benchmarks.However, the DESFO-RF approach was significantly more effective than other methods in 12 out of 14 benchmarks, equivalent to 85.7% of all the benchmarks.Also, it's worth noting that the SFO method was ranked second on several benchmarks.It showed a slight improvement of 0.0034% on the Lymphography benchmark and 0.0020% on the M-of-n benchmark while achieving the same score as the top performer on five other benchmarks.
Table 7 compares the mean number of selected features between the DESFO-RF method and other popular meta-heuristic optimization algorithms commonly used for feature selection (FS) strategy.When Table 7 is analyzed, the observation shows that DESFO-RF and SFO produce similar results regarding the number of selected features, and both outperform the other algorithms.These two techniques won in two benchmarks and tied in three benchmarks, surpassing the other algorithms: DE, ABC, PSO, BA, GWO, WOA, GOA, HHO, BSA, and HGSO.However, it is important to note that this does not necessarily imply a tie in classification accuracy between DESFO and SFO.DESFO has demonstrated superiority over other algorithms.Furthermore, it should be kept in mind that choosing the smallest number of characteristics may negatively impact classification accuracy.

Comparisons based on the K-NN classifier
Table 8 compares the average fitness values between the proposed DESFO-K-NN and other advanced MH optimization algorithms in addressing the FS problem.After examining Table 8, the DESFO-K-NN outperformed all other methods in 9 benchmarks and tied in 2 benchmarks in the FS problem.This indicates that DESFO-K-NN had a significantly better impact on 11 out of 14 benchmarks, accounting for 85.7% of all benchmarks.Additionally, the study employed a benchmark of both large and small-scale benchmarks, indicating that DESFO-K-NN can deliver consistent performance across the entire range of benchmarks, irrespective of their size.For the two  9, it is essential to note that DESFO-K-NN outperformed all other methods regarding accuracy mean values across seven benchmarks.In the remaining seven benchmarks, results were similar to those achieved by the different methods.DESFO-K-NN also showed significantly better performance in all 14 benchmarks, accounting for 100% of all benchmarks, which is a remarkable improvement compared to other methods.Additionally, In Table 10, a comparison of the mean number of selected features between the DESFO-K-NN method and other established meta-heuristic optimization algorithms is given.This comparison helps us understand the effectiveness of the DESFO-K-NN method in addressing the FS strategy.10, it can be inferred that the DESFO-K-NN algorithm has better exploration capabilities compared to other algorithms, as it has the lowest mean selected features number among all the algorithms tested (winning in 5 out of 7 cases and tying in 2 cases).This performance is superior to DE, PSO, GWO, GOA, BSA, and HGSO algorithms.It is worth mentioning that even though SFO selected fewer irrelevant features compared to DESFO-K-NN and other methods on only a few benchmarks (lymphography, vote, and Zoo), and achieved the same performance as DESFO-K-NN on two benchmarks (WineEw and BreastCancer), it did not outperform DESFO-K-NN in terms of mean accuracy.When selecting a minimal number of characteristics for classification, it is important to note that this approach can harm accuracy.The DESFO-K-NN algorithm has been proposed to efficiently identify the pertinent attributes and reduce the feature search area without compromising the classification accuracy.The algorithm achieves optimal results by discarding insignificant search areas and concentrating on the most viable ones.

Analysis and visualization
An analysis for DESFO-RF and DESFO-K-NN, used for handling the FS strategy, has been performed in this section using asymptotic analysis.To validate their convergence capabilities, the proposed technique was applied to 14 widely used benchmark datasets, and their performance has been compared against their peers under identical conditions, including the iteration number and population size.Figures 3 and 4 demonstrate the convergence ability of these methods in comparison to their counterparts.Based on the results depicted in Fig. 3, the DESFO-RF approach showcases rapid yet effective convergence across eight benchmarks, including PenglungEW, IonosphereEW, SonarEW, WaveformEW, KrVsKpEW, BreastEW, Zoo, and Exactly2.On the other hand, Fig. 4 highlights that the DESFO-K-NN model outperforms the competition in five benchmarks, namely PenglungEW, IonosphereEW, SonarEW, WaveformEW, KrVsKpEW, BreastEW, Lymphography, Exactly2, and Lymphography.It's worth noting that both the proposed algorithms (DESFO-RF and DESFO-K-NN) balance exploration and exploitation, ensuring the timely acquisition of the optimal solution.
Figures 5, 6, and 7 show the performance of DSEFO and other methods regarding Mean fitness Function values with RF and K-NN.The box plot with the swarm plot is demonstrated in Figs. 5 and 6, showing the superiority of DESFO over other algorithms.The plots reveal no outliers with Both DESFO-RF and DESFO-K-NN, unlike the DE, PSO, and HGSO Algorithms.The swarm plot demonstrates that most values are in the boxplot's interquartile range (IQR).Figure 7 shows the KDE plots, demonstrating the performance of DESFO and the other algorithms with the 14 UCI benchmarks.
Figures 8, 9, and 10 show the performance of DSEFO and other methods regarding Mean classification accuracy with RF and K-NN.Figures 8 and 9 illustrate the box plot with the swarm plot, highlighting the superior performance of DESFO over other algorithms.A noticeable observation from the plots is that no outliers exist in DESFO-RF and DESFO-K-NN, unlike other algorithms such as DE, PSO, BA, BSA, GOA, and HGSO Algorithms.The swarm plot indicates that for DESFO with RF and KNN, most of the values are located in the interquartile range (IQR) and the maximum value of the boxplot.Additionally, Fig. 10 shows KDE plots that depict the performance of DESFO and other algorithms with the 14 UCI benchmarks.

Wilcoxon's analysis
The statistical significance of the analysis can be observed in Tables 11 and 12, where the Wilcoxon test was conducted as a pair-wise assessment.This test helped to determine if there was a significant difference between the fitness results achieved by the proposed DESFO algorithm and its counterparts 74 .
The Wilcoxon test is a statistical test often used in hypothesis testing situations.The test involves ranking the differences between the results of two paired algorithms on a set of problems.The calculation of ranks is based on the absolute values of the differences.Next, the positive and negative ranks are summed separately as R + and R − .The smaller sum between the two is recorded.If the significance level of the recorded results is less than 5%, then the null hypothesis is rejected.On the other hand, if the significance level is greater than 5%, then the null hypothesis is not rejected.
After analyzing the data presented in Tables 11 and 12, it can be concluded that the DESFO-RF and DESFOk-NN algorithms outperformed all other algorithms in all the tested scenarios.In Tables 11 and 12, the indicated p values are below 5%, implying that the proposed method's results are statistically significant.This strong evidence against the null hypothesis suggests that the outcomes obtained are not due to chance.

Discussion
According to the results of the empirical analysis, the DESFO algorithm stands out among recent algorithms in terms of its reliability in feature selection for classification tasks.This algorithm makes use of k-NN and RF classifiers.Among all the benchmarks, DESFO-K-NN produced the best results in terms of mean accuracy, followed by DESFO-RF.Additionally, the DESFO optimizer demonstrated a more pronounced exploration and exploitation behavior than its counterparts.On the other hand, The DESFO method exhibits a limitation in that it selects more features than its competitors across various datasets.Specifically, when compared with other

•
Complexity Breakdown by Component 1-Initialization: Initializes NP individuals, each possessing D features.This operation has a complexity of O (NP × D).

Figure 3 .
Figure 3.The convergence graphs comparing the suggested DESFO approach with other methods using the RF Classifier.

Figure 4 .
Figure 4.The convergence graphs comparing the suggested DESFO approach with other methods using the K-NN Classifier.

Figure 7 .
Figure 7. KDE plot diagram of DESFO and other Algorithms performance in terms of fitness value.

Figure 8 .
Figure 8. Box and swarm plot of DESFO-RF and Algorithms performance in term of Classification Accuracy. .
algorithm in terms of Mean Acc in most of the benchmarks and Mean Feat .The suggested DESFO-RF approach has shown promising results in feature selection compared to the main RF on the chosen benchmarks.

Table 3 .
Comparison of Maean Acc and Mean Feat for DESFO-K-NN & the basic K-NN.Superior values are in [bold].

Table 4 .
Comparison of Maean Acc and Mean Feat for DESFO-RF & the basic RF.Superior values are in [bold].

Table 5 .
Results comparison of the mean fitness value (Mean Fit ) based on RF classifier for DESFO with other.MH methods Superior values are in [bold].

Table 6 .
Results comparison of the mean accuracy (Mean Acc ) based on RF classifier for DESFO with other MH methods.Superior values are in [bold].

Table 7 .
Results comparison of the mean number of features selected (Mean Feat ) based on the RF classifier for DESFO with other MH methods.Superior values are in [bold].Based on the results shown in Table

Table 8 .
Results comparison of the mean fitness value (Mean Fit ) based on the K-NN classifier for DESFO with other MH methods.Superior values are in [bold].

Table 9 .
Results comparison of the mean accuracy (Mean Acc ) based on the K-NN classifier for DESFO with others.MH methods.Superior values are in [bold].

Table 10 .
Results comparison of the mean number of features selected (Mean Feat ) based on the K-NN classifier for DESFO with other MH methods.Superior values are in [bold].